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Abstract  -  This  paper  presents  a  design  for  parallel  processing 
of  synthetic  aperture  radar  (SAR)  data  using  multiple  Graphics 
Processing  Units  (GPUs).  Our  approach  supports  real-time 
reconstruction  of  a  two-dimensional  image  from  a  matrix  of  echo 
pulses  and  their  response  values.  Key  to  runtime  efficiency  is  a 
partitioning  scheme  that  divides  the  output  image  into  tiles  and 
the  input  matrix  into  a  collection  of  pulses  associated  with  each 
tile.  Each  image  tile  and  its  associated  pulse  set  are  distributed  to 
thread  blocks  across  multiple  GPUs,  which  support  parallel 
computation  with  near-optimal  I/O  cost.  The  partial  results  are 
subsequently  combined  by  a  host  CPU.  Further  efficiency  is 
realized  by  the  GPU’s  low-latency  thread  scheduling,  which 
masks  memory  access  latencies.  Performance  analysis  quantifies 
runtime  as  a  function  of  input/output  parameters  and  number  of 
GPUs.  Experimental  results  were  generated  with  10  nVidia  Tesla 
C2050  GPUs  having  maximum  throughput  of  972  Gflop/s.  Our 
approach  scales  well  for  output  (reconstructed)  image  sizes  from 
2,048  x  2,048  pixels  to  8,192  x  8,192  pixels. 

Index  Terms — High  performance  computing,  Synthetic 
aperture  radar.  Image  reconstruction,  Graphics  processing  unit 

I.  Introduction 

The  use  of  radio-frequency  electromagnetic  radiation  for 
object  detection  in  a  synthetic  aperture  radar  (SAR) 
configuration  comprises  a  key  technique  for  surveillance 
imaging  [1,2].  In  practice,  an  electromagnetic  pulse  from  an 
emitter  location  pP  is  reflected  by  a  target,  and  the  reflected 
pulse  components  are  sensed  at  a  receiver  location  pp  and 
quantized  temporally  into  NB  bins.  Given  NP  pulses,  where 
each  pulse  is  taken  at  different  sensor  configurations  of  form 
(pP,  pP),  an  NP  X  N/s  element  pulse  data  matrix  P  is  obtained. 

Computational  transformation  of  P  into  an  /Vx/V-pixel 
image  b  that  depicts  targets  within  the  sensor’s  field-of-view 
(FOV)  requires  superposition  of  pulse  effects,  implying  an 
additive  process  where  each  pulse  and  its  range  bins  in  P 
potentially  influences  each  pixel  in  b.  The  globality  of  this 
approach,  which  uses  multiple  pulses  from  different  emitter 
locations,  helps  resolve  ambiguities  resulting  from  the  fact  that 
a  pulse  at  given  time  delay  references  multiple  points  in  the 
target  plane  that  are  equidistant  from  the  emitter  [3]. 
Advantageously,  since  the  SAR  reconstruction  concept  is 
similar  in  structure  (but  not  identical  mathematically)  to  a 
tensor  product,  one  can  adapt  various  distributed  processing 
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strategies  developed  for  the  tensor  product  [4]  to  implement 
SAR  image  reconstruction  on  a  cluster  of  GPUs. 

In  this  paper,  we  present  techniques  for  parallelizing  SAR 
image  reconstruction  algorithms  to  run  efficiently  on  a  CPU- 
controlled  cluster  of  multiple  graphics  processing  units 
(GPUs).  Our  technique  is  illustrated  in  terms  of  a  single-stage 
backprojection  algorithm  whose  computer  code  and  data  are 
available  publicly  [5,6].  Performance  of  our  approach  varies 
quasilinearly  with  NPR,  and  does  not  appreciably  compromise 
the  numerical  accuracy  of  the  reconstructed  image  b  with 
respect  to  a  reference  image  a.  Our  technique  employs 
algorithm-to-architecture  mapping  methods  that  feature  a  mix 
of  distributed-  and  shared-memory  models  that  reflect  key 
GPU  organizational  constraints. 

This  paper  is  organized  as  follows.  Section  II  provides 
theoretical  and  practical  background.  Section  III  presents  the 
parallelized  computer  codes  that  comprise  our  approach. 
Timing  and  error  measurements  and  analysis  are  given  in 
Section  IV,  with  conclusions  and  suggestions  for  future  work 
given  in  Section  V. 

II.  Theoretical  and  Practical  Background 

We  begin  with  a  discussion  of  the  single-stage  SAR 
backprojection  algorithm  (Section  II. A)  and  then  discuss 
previous  work  in  parallelizing  backprojection  algorithms 
(Section  II. B).  An  overview  of  the  CPU  and  GPU 
architectures  employed  (Section  II. C)  provides  practical 
background  for  the  subsequent  discussion  of  algorithm-to- 
architecture  mapping  in  Section  III. 

A.  Single-Stage  SAR  Backprojection  Algorithm 

Several  published  algorithms  for  SAR  reconstruction  from 
sensed  radar  pulse  data  have  been  optimized  for  sequential 
systems  with  relatively  low  throughput  [2].  Here,  algorithm 
design  seems  to  be  influenced  by  computational  performance, 
rather  than  quality  of  the  reconstructed  image.  In  contrast,  a 
naive  implementation  of  the  single-stage  backprojection 
algorithm  (BP1)  [5],  is  computationally  costly  but  yields  high 
numerical  quality  in  the  reconstructed  image. 

In  practice,  BP1  examines  the  pairing  of  every  received 
(postprocessed)  pulse  with  every  reconstructed  pixel  to 
estimate  object  reflectivity  at  each  point  in  the  spatial 
representation.  BP  1  inputs  pulse  response  matrix  P ,  an  NP  X  NB 
element  matrix  of  NP  pulses  and  NB  response  bins  obtained  by 
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temporally  quantizing  the  received  reflections  of  each  radar 
pulse  (Section  I).  For  a  given  pulse,  responses  sensed  later  in 
time  are  physically  further  from  emitter  location  p,..  thusly  are 
placed  into  higher-numbered  range  bins.  For  each  pixel  in  the 
output  image,  every  pulse  and  range  bin  in  that  pulse  is 
considered  separately.  The  value  in  a  given  range  bin  is  added 
to  the  value  of  the  associated  pixel  of  the  reconstructed  SAR 
image.  Brighter  areas  in  the  reconstructed  image  are 
associated  with  greater  target  reflectivity. 

Implementationally,  we  observe  that  for  matrices  A  and  B, 
the  tensor  product  C  =  A®  B  pairs  each  element  ay  of  /)  with  a 
sub-matrix  b  in  B  to  form  the  partial  product  Cy  =  ay  *  b,  where 
(*)  denotes  pointwise  multiplication.  In  practice,  a  host 
processor  marshals  products  Cy  to  form  the  matrix  C. 

We  exploit  this  tensor  product  structure  for  SAR  image 
reconstruction  (albeit  with  different  mathematical  details)  by 
associating  each  pulse  and  bin  in  P  with  each  pixel  of  b.  In 
particular,  let  P[/].bin[/]  denote  the  /h  range  bin  of  pulse  i  after 
phase  correction,  and  let  (P[/J  .x,  P\i\y,  P[/].z)  denote  the 
source  of  pulse  i  with  respect  to  a  prespecified  origin.  If  each 
pulse  has  range  [Rslart,  Rend\,  then  the  intensity  of  output  image 
b  corresponding  to  physical  location  ( x,y )  can  be  written  as: 

J(-y  -  rfb-f  +  iy  -  41-y)2  +  (^['1^  (1) 

(Re„d-Rstart)/\P['lbin\ 

Note  that  this  model  captures  the  data  movement  patterns 
of  BP1,  although  our  implementation  includes  several 
enhancements  that  improve  output  quality.  For  instance,  in 
Equation  (1),  it  was  assumed  that  b (x,y)  maps  to  a  range  bin 
indexed  by  an  integer.  In  practice,  reconstructed  pixels  are 
associated  with  temporal  intervals  that  can  overlap  range  bins, 
and  can  be  thought  of  as  mapping  to  bins  that  have  a  fractional 
index.  In  such  cases,  an  interpolated  (smoother)  image  can  be 
generated  by  examining  range  bins  adjacent  to  a  fractionally- 
indexed  bin,  then  computing  their  weighted  average  based  on 
their  distance  from  this  bin.  However,  such  enhancements  are 
not  the  focus  of  this  paper;  we  view  any  implementation  of  the 
backprojection  algorithm  with  a  data  access  pattern  as  shown 
in  Equation  1  as  being  logically  equivalent.  For  additional 
details,  the  reader  is  referred  to  [2]. 

B.  Related  Work 

The  approach  presented  herein  implements  fast  parallel 
processing  of  SAR  data  as  well  as  reconstruction  of  a  two- 
dimensional  SAR  image.  Additionally,  our  techniques  could 
be  applied  to  a  domain  where  sensor  response  data  is  projected 
back  to  form  an  image  of  a  surface  or  object.  For  example, 
computer-aided  tomography,  featuring  the  construction  of  a  3- 
D  model  from  a  set  of  2-D  cross-sectional  views,  as  in  medical 
imaging  for  obtaining  a  three  dimensional  view  of  tissue  in 
vivo.  Here,  each  volumetric  unit  or  voxel  has  properties  similar 
to  a  reconstructed  SAR  pixel,  and  the  tensor-product-like 
structure  holds.  Namely,  each  voxel  in  each  cross-sectional 
view  is  associated  with  a  response  value  that  is  spatially  near 
the  response  of  its  neighboring  voxel.  Similarly,  a  given 
volume  of  output  data  will  require  a  predictable  quantity  of 


sensor  response  data  in  each  cross  section.  The  preservation  of 
these  properties  ensures  that  our  partitioning  scheme  provides 
efficient  data  access  mechanisms  for  generating  tomograms 
[7,8]. 

Other  potential  applications  differ  from  the  preceding 
applications  involving  image  sampling.  For  example,  network 
tomography  infers  network  characteristics  from  observations 
taken  at  known  locations.  In  this  case,  each  node  in  the 
network  core  is  analogous  to  a  pixel  in  the  SAR  output  image, 
and  each  network  location  is  conceptually  similar  to  a  radar 
pulse.  Observe  that  the  latency  or  reliability  of  a  channel  can 
be  inferred  from  the  repeated  transfer  of  packets  between 
locations.  The  projection  of  this  data  back  onto  the  nodes 
through  which  packets  have  travelled  can  produce  a  visual 
representation  of  the  network  at  any  time,  without  requiring 
explicit  assistance  from  core  nodes.  In  this  paradigm,  spatial 
locality  follows  from  the  route  optimization  properties  of 
routing  protocols  [3,9]. 

A  similar  problem  involves  estimation  of  ocean 
temperatures  from  acoustic  wave  propagation  latencies,  as  the 
speed  of  sound  in  water  varies  inversely  with  water 
temperature.  By  measuring  acoustic  wave  propagation  times 
between  multiple  emitters  and  receivers,  a  set  of  cross 
sectional  images  can  be  constructed  analogous  to  cross 
sections  in  computer-aided  tomography.  Small  three 
dimensional  units  of  water,  not  unlike  the  voxels  of  computer 
aided  tomography,  become  the  unit  onto  which  acoustic  sensor 
response  data  is  projected.  Spatial  locality  of  input  data 
follows  from  the  output  partitioning  scheme  [10]. 

C.  Graphics  Processing  Unit  (GPU) 

GPU  devices  primarily  support  fast,  photorealistic  image 
rendering  for  video  gaming  applications.  GPUs  have  recently 
been  applied  to  supercomputing,  due  to  fast  SIMD  processing 
and  the  ability  of  some  GPUs  to  compute  double-precision 
floating  point  arithmetic  [11].  Additionally,  GPUs  have  a  huge 
installed  base  that  amortizes  significant  research  and 
development  expenditures,  thereby  achieving  relatively  low 
unit  cost  (<100  USD)  with  frequent  upgrades. 

Unfortunately,  GPUs  feature  a  hierarchical  memory 
structure  that  is  understandably  designed  for  fast  graphics 
operations,  but  does  not  necessarily  support  all  scientific 
computations.  This  particularly  holds  for  image  and  signal 
processing  with  large  operands,  and  for  grid-  and  mesh-based 
simulation  using  huge  model  domains.  Typical  GPU  multi¬ 
level  memory  structure  combines  a  shared  memory  model 
(DRAM  or  global  memory  and  L2  cache  or  shared  memory  in 
Fig.  1)  in  the  context  of  groupings  of  cores  (processing 
elements)  called  multiprocessors  or  streaming  multiprocessors 
having  local  memory  (LI  cache  in  Fig.  1).  As  a  result  of  the 
varying  latencies  between  DRAM,  L2  and  LI,  as  well  as  small 
local  memories,  data  movement  across  a  GPU  is  fraught  with 
challenges  that  feature  memory  access  latencies  which  can  be 
quite  large  for  apparently  simple  operations. 

As  a  result,  the  porting  of  legacy  code  to  GPUs  has 
emphasized  manual  optimization,  although  recent  reports 
indicate  some  automation  is  possible  [12].  In  practice, 
application-to-architecture  mapping  can  be  supported  by  an 
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underlying  code-oriented  architecture.  For  example,  the 
Nvidia  Tesla  C2050  GPU  used  in  this  study  has  14  streaming 
multiprocessors  (SMs),  each  containing  32  processing 
elements  or  cores  -  totaling  448  cores  per  device.  To 
coordinate  many  cores,  GPUs  arrange  threads  into  thread 
blocks.  In  particular,  each  thread  block  embodies  the  execution 
of  a  single  code  fragment  (or  kernel )  across  multiple  parallel 
threads  that  execute  on  a  single  SM.  So,  at  least  14  thread 
blocks  are  needed  to  engage  all  14  SMs  in  a  C2050  GPU  and 
each  thread  block  must  comprise  at  least  32  threads  to  engage 
the  32  cores  in  an  SM.  In  practice,  to  effectively  hide  memory 
latency,  may  more  thread  blocks  and  many  more  threads  per 
block  are  often  needed. 


Fig.  1 .  High-level  view  of  Nvidia  Tesla  C2050  GPU  architecture. 

During  program  design,  a  programmer  specifies  the  parts 
of  an  application  that  are  suited  for  threaded  (parallel) 
execution,  by  labeling  them  as  kernels  using  syntax  specified 
by  the  GPU  programming  language  (e.g.,  CUDA  for  an 
Nvidia  Tesla  processor).  At  runtime,  the  threads  of  a  thread 
block  can  access  a  common  shared  memory,  thereby 
supporting  a  form  of  inter-thread  communication.  A  much 
slower  global  memory,  accessible  from  all  thread  blocks,  helps 
transfer  data  to  and  from  the  host  as  well  as  to  and  from  the 
shared  memory  of  the  SMs  [11].  This  global  memory  also 
facilitates  inter-block  and  inter-SM  communication.  As  in 
other  hierarchically-configured  memory  systems,  high 
performance  implies  emphasizing  local  (intra-SM)  memory 
operations,  with  minimal  data  transfer  to  and  from  slower 
memory  devices. 

GPUs  can  operate  efficiently  with  thousands  of  threads 
running  simultaneously,  as  thread  scheduling  is  implemented 
directly  in  hardware  with  negligible  overhead  compared  to 
program  execution  time.  Support  for  many  threads  allows 
masking  of  memory  access  latencies:  other  threads  can  operate 
while  paused  threads  wait  for  I/O  operations  to  complete. 

The  BP1  algorithm  is  an  ideal  candidate  for  GPU 
implementation,  since  (a)  each  output  pixel  can  be  viewed  as 
the  sum  of  the  contribution  of  all  input  pulses,  and  (b)  the  set 
of  operations  used  to  calculate  this  contribution  is  not 


dependent  on  the  value  of  the  input.  However,  some  clever 
data  partitioning  is  required  to  realize  efficiency.  Efficient 
single  GPU  implementations  of  BP1  were  described  by  us  in 
[3,13]. 


III.  Multi  GPU  Algorithm  Design  and  Implementation 

The  SAR  backprojection  algorithm  (BP1)  was 
implemented  via  partitioning  of  the  pulse  matrix  P  and 
reconstructed  image  b,  as  discussed  in  Section  III. A.  Code 
structure  is  discussed  in  Section  III.B. 


A.  Algorithm-to-Architecture  Mapping  Technique 

In  order  to  restrict  the  pulse  data  in  P  to  fit  in  GPU  local 
memory,  but  not  to  be  so  small  to  cause  excessive  latency  due 
to  repetitive  memory  accesses,  we  employed  a  projection 
function  /  that  relates  the  sensing  geometry  for  a  given 
response  P(ij),  specific  to  the  ;th  pulse  and  /h  range  bin,  to  the 
corresponding  pixel  coordinate(s)  in  b  [3].  Let  P  have  Np  X  Nb 
domain  Y,  and  let  b  have  Nx  X  Nv  domain  X.  The  forward 
projection  function  f :  Y  — >  2X  forms  a  basis  for  derivation  of 
a  reverse  projection  function  g  :  2X  — >  2Y.  The  function  g  is 
useful  for  our  BPlmg  multi  GPU  implementation,  because  it 
accepts  the  coordinates  of  a  reconstruction  neighborhood  (also 
called  a  tile)  denoted  by  T  c  X,  and  produces  the 
neighborhood  ^k(Y)  which  denotes  the  coordinates  of  pulses 
corresponding  to  the  restriction  b|r.  Due  to  sensing  geometry 
at  typical  altitudes,  we  have  |^T(Y)|  «  bi(Y)[»  where  |S| 
denotes  cardinality  of  set  S,  and  pk  denotes  projection  to  the  kth 
coordinate. 

In  particular,  in  this  study  we  specified  a  Ax //-pixel  tile  T 
to  which  we  applied  g,  thereby  supporting  the  following 
restriction  of  the  pulse  matrix: 


P'  =  PU  (2) 

which  was  then  assigned  to  a  thread  block  for  processing  by  a 
streaming  multiprocessor  to  yield  b|r.  This  partial  result  was 
sent  to  the  CPU  for  accumulation  into  reconstructed  image  b. 
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Fig.  2.  Tile-  and  thread-block-based  mapping  technique  for  multi-GPU 
computation  of  algorithm  BPlmg  shown  in  Figure  6  (from  [3]). 


Given  this  approach  (see  Fig.  2),  efficiency  is  realized  by 
(a)  minimizing  |  P'  |  without  compromising  reconstruction 
accuracy,  (b)  balancing  K,  Nx,  and  Ny  to  minimize  I/O  and 
memory  latencies  associated  with  the  tiling  of  b,  and  (c) 
designing  an  efficient  reverse  projection  g  which  can  be  done 
with  (i)  table  lookup  or  (ii)  explicit  coordinate  computations. 
The  Nvidia  Tesla  C2050  GPU  that  was  employed  in  this 
study,  being  optimized  for  graphics,  is  well  suited  to  either  i) 
or  ii).  In  this  paper,  due  to  space  limitations,  we  focus  on 


techniques  a)  and  b).  The  reader  is  referred  to  [3,13]  for  a 
discussion  of  the  technique  in  method  c). 

B.  Code  Examples 

Parallelization  of  BP1,  which  was  specified  in  Matlab™  , 
as  overviewed  in  Fig.  3),  was  first  expressed  in  C  as  outlined 
in  Fig.  4,  then  in  the  CUDA  language,  as  outlined  at  a  high 
level  in  Fig.  5  (single-GPU  version)  and  Fig.  6  (multi-GPU 
cluster  version). 


L.l 

II  Read  all  pulse  data  into  memory 

L.2 

for  each  pulse  in  the  Npuises  dataset 

L.3 

II  Perform  Inverse  FFT  (ifft) 

L.4 

II  Perform  FFT  Shift 

L.5 

for  each  pixel  in  the  Nx  x  Ny  output  image 

L.6 

II  Perform  image  reconstruction 

L.7 

end 

L.8 

end 

L.9 

//Write  image  to  file 

Fig.  3. 

Outline  of  Matlab  code  for  BP1  backprojection  algorithm. 

L.l 

For  each  set  of  pulses  /"from  the  dataset  P 

L.2 

II  Read  in  a  the  next  pulse  set/"  serially, 

using  one  thread 

L.3 

II  Perform  Inverse  FFT  and  Shift  operations  on 

this  set  of  pulses  in  parallel 

L.4 

#pragma  parallel  for  each  pixel  in  the  Nx  x  Ny 

output  image  b 

L.5 

II  Perform  image  reconstruction 

L.6 

end 

L.7 

end 

Fig.  4.  Outline  of  C  code  for  BP1  parallel  backprojection  algorithm.  Note 
the  use  of  the  MPI  #  p  r  a  gma  parallel  construct  for  the  distributed 
memory  model  illustrated  in  Fig.  1. 


Given  the  code  for  BPlmg  outlined  in  Fig.  6,  which  uses 
the  mapping  approach  portrayed  notionally  in  Fig.  2,  we 
measured  kernel  runtime  and  numerical  error  associated  with 
the  serial  and  parallel  BP1  codes,  for  a  multicore  CPU 
controlling  up  to  10  Nvidia  Tesla  C2050  GPUs,  as  follows. 


L.l  II  Read  all  pulse  data  into  host  CPU  memory 

L.2  II  Write  all  pulse  data  onto  GPU  device  memory 

L.3  II  Perform  Inverse  FFT  for  all  Pulse  data  on  the  GPU 

L.4  II  Perform  FFT  Shift  for  all  Pulse  data  on  the  GPU 

L.5  Call  kernel  to  perform  reconstruction 

L.6  for  each  pulse  in  the  Npulses  dataset 

L.7  II  Perform  reconstruction  of  four  pixels  per 

CUDA  thread 

L.8  end 

L.9  end  of  kernel 

L.10  II  Read  reconstructed  image  from  GPU  device 
L.ll  //Write  image  to  file 


Fig.  5.  Outline  of  single  GPU  code  for  BPlsg  backprojection  algorithm. 


IV.  Experimental  Results  and  Discussion 

The  multi-GPU  version  of  backprojection  algorithm  BP1, 
hereafter  called  BPlmg,  was  compiled  with  the  C/MPI 
compiler  gcc  version  4.1.2  Revision  20080704  (Red  Hat  4.1.2- 
52)  and  the  CUDA  compiler  Release  4.0,  V0. 2. 1221,  on  five 
remote  nodes  of  the  Condor  architecture  [14],  where  each 
remote  node  has  two  Nvidia  C2050  GPUs  (see  Fig.  2).  One 
thread  was  used  on  each  of  six  CPUs:  one  on  the  localhost 
(which  had  no  GPU)  and  five  on  the  remote  nodes.  The 
localhost  and  each  remote  node  had  an  Intel  Xeon  6-core  CPU 
X5650  running  at  2.67GHz  with  hyperthreading.  The  GPU 
code,  BPlmg,  is  structured  as  shown  in  Fig.  6. 


L.l  II  Master  thread  sends  arguments  to  slave  threads  (STs) 
L.2  II  Master  thread  reads  data  file  and  broadcasts  it  to  STs 
L.3  II  Each  slave  thread  is  assigned  a  subset  P'  of  the  total 
number  of  pulses  for  each  device. 

L.4  II  Each  device  performs  Inverse  FFT  and  Shift  on  P' 
L.5  II  Each  device  performs  image  reconstruction  given 
the  pulse  subset/" 

L.6  for  each  set  P'  of  pulses  from  the  dataset  P 
L.7  for  each  pixel  assigned  to  thread  in  the  Nx  x  Ny 
construct  image  tile  b|r 

L.8  II  Perform  reconstruction  on  the  tile 

L.9  end 

L.10  end 

L.ll  II  Each  slave  thread  merges  the  reconstructed  image 
given  the  reconstructed  tiles  from  each  device 
L.12  II  Master  thread  merges  the  reconstructed  tiles  from 
each  slave  thread  to  yield  reconstructed  image  b 
L.13  II  Master  Thread  writes  the  reconstructed  imaae  to  file 


Fig.  6.  Outline  of  multi-GPU  code  BPlmg  for  backprojection  algorithm. 

A  hand-optimized  version  of  BPlmg  was  tested  using  the 
Tesla  C2050’s  double  precision  floating  point  arithmetic.  The 
input  data  (pulse  data,  antenna  locations,  etc.)  as  well  as  the 
reconstructed  image  are  single  precision  floating  point;  all 
operations  in  the  reconstruction  kernel  are  performed  in 
double  precision.  Pulses  before  (respectively,  after)  the 
Inverse  Fast  Fourier  Transform  (IFFT)  are  5,592  pulses  (resp. 
16,384  pulses). 

A.  Runtime  Measurement  and  Analysis 

Timing  data  for  BPlsg  and  BPlmg  are  presented  in  Tables 
I  (total  runtime)  and  II  (kernel  runtime),  with  the  same 
parameters  as  for  the  reference  single-GPU  algorithm  BPlsg 
similar  to  that  described  in  [3].  Total  execution  time  (sec, 
including  I/O,  message  passing,  memory  allocation  and 

TABLE  I.  Total  Runtime  in  Seconds  for  Single-  and  Multi-GPU 
Versions  of  the  BP1  Backprojection  Algorithm. 


Runtime 

Algorithm  Version 

Nx  x  Nr 

BPlsg 

BPlmg  /max 

Speedup 

2048x2048 

15.357 

2.494 

6.158  x 

4096x4096 

48.080 

6.468 

7.434  x 

8192x8192 

178.806 

22.039 

8.113  x 

transfer,  and  IFFT/Shift)  for  double-precision  GPU  cluster 
version  BPlmg  (labeled  “Multi”)  of  the  single-stage 
backprojection  algorithm  ( Nputses  =  5,004).  Speedup  is 
computed  from  the  maximum  runtime  for  all  cases  of  BPlmg, 
with  respect  to  the  single-GPU  version  BPlsg.  Timing  data 
for  total  run  time  are  summarized  graphically  in  Fig.  7. 

As  shown  in  Table  II,  the  kernel  speedup  for  image  sizes 
ranging  from  4Kx4K  to  8Kx8K  effectively  equals  the 
expected  theoretical  value  of  lOx.  Slight  differences  in 
speedup  (e.g.,  10.019x  instead  of  lOx)  result  from  systematic 
errors  due  to  limited  temporal  resolution  of  the  tic. ..toe  timer 
mechanism. 

TABLE  II.  Kernel  Runtime  in  Seconds  for  Single-  and  Multi-GPU 
Versions  of  the  BP1  Backprojection  Algorithm. 


Runtime 

Version  /  Number  of  GPUs 

NyXNy 

BPlsg 

BPlmg/1 

BPlmg/2 

BPlmg/3 

BPlmg/4 

BPlmg/5 

2048x2048 

11.39 

1.133 

1.132 

1.132 

1.231 

1.232 

4096x4096 

44.05 

4.333 

4.384 

4.387 

4.383 

4.415 

8192x8192 

173.79 

17.051 

17.247 

17.254 

17.255 

17.246 

Runtime 

Version  /  Number  of  GPUs  (cont’d) 

NxxNv 

BPlmg/6 

BPlmg/7 

BPlmg/8 

BPlmg/9 

BPlmg/10 

Speedup 

2048x2048 

1.198 

1.200 

1.201 

1.199 

1.152 

9.243X 

4096x4096 

4.384 

4.386 

4.387 

4.399 

4.415 

9.978X 

8192x8192 

17.346 

17.334 

17.334 

17.335 

17,346 

10.019X 

SingleGPU 
--•-Multiple  GPU 


Output  Image  Size,  pixels 


Fig.  7.  Total  Execution  Time,  BPlmg  algorithm  running  in  double 
precision  floating  point  arithmetic  on  one  to  five  Condor  nodes  with  MPI. 


B.  Reconstruction  Error  Measurement  and  Analysis 

In  the  following  discussion,  we  refer  to  average  error  and 
maximum  error,  which  are  computed  as  follows.  Assume  that 
a  reference  image  a  is  an  Nx  X  Ny  pixel  array,  which  is  used  as 
a  basis  for  comparing  the  reconstructed  image  b,  also  an  Nx  X 
Nr  pixel  array,  by  computing  a  difference  image  d  as: 

d (if)  =  c(if)  -  a (if),  1  <  i  <  Nx  and  1  <j  <  Ny  .  (3) 

The  maximum  error  and  average  error  are  defined  from  d, 
as  follows: 

emax  =  V,;  d (ij)  /  range  (4) 

and 

eave  =  Zij  d (ij)  /  ( range  •  Nx  •  Ny) ,  (5) 

where  range  denotes  the  range  of  pixel  values  in  b. 


In  tests  using  Equations  3-5,  we  found  that  the  average  and 
maximum  output  errors  were  identical  to  the  output  error  for 
the  double-precision  single-GPU  algorithm  BPlsg  outlined  in 
Fig.  5.  This  is  reasonable,  since  BPlmg  is  mathematically  the 
same  as  BPlmg,  but  runs  on  multiple  GPUs  instead  of  a  single 
GPU  device. 

We  have  also  found  that  measured  error  figures  depend  on 
the  version  of  the  Tesla  C2050  architecture  (e.g.,  SM10  or 
SM20),  as  well  as  the  trigonometric  functions  employed.  In 
particular,  sincosf  and  _sincosf  refer  to  a  combined 
trigonometric  function  that  computes  in  single  precision,  while 
sincos  refers  to  a  combined  trigonometric  function  that 
computes  in  double  precision. 

For  GPU  version  SM10  operating  in  single  precision  mode, 
the  maximum  error  (relative  to  double  precision  version  SM20 
using  sincos )  for  the  4Kx4K  output  image  was  4.42  /  255  (or 
2.8  percent),  and  5.45  /  255  (or  3.45  percent)  for  the  8Kx8K 
image.  However,  for  both  images  the  average  error  was  the 
same  (0.0511  /  255  or  0.03  percent).  We  have  found  that  the 
maximum  error  and  average  error  were  unaffected  by  the 
precision  of  the  trigonometric  functions. 

In  the  case  of  GPU  version  SM20  operating  in  single 
precision  mode,  the  maximum  and  average  errors  were 
similarly  unaffected  by  the  precision  of  trigonometric 
functions.  For  the  4Kx4K  image,  the  max  error  was  2.32  /  255 
or  1.65  percent,  while  for  the  8Kx8K  image,  the  maximum 
error  was  2.60  /  255  or  1.65  percent.  The  average  error  was 
0.046  /  255  or  0.03  percent  for  both  image  sizes.  While  the 
maximum  error  was  approximately  half  that  obtained  from 
SM10  single  precision  computation,  the  average  error  was 
reduced  by  only  10  percent. 

For  GPU  version  SM10  operating  in  double  precision 
mode,  double  datatypes  are  demoted  to  float  datatypes. 
Although  trigonometric  precision  does  not  affect  the 
maximum  and  average  errors,  these  are  not  the  same  as  when 
SM10  single  precision  is  used  -  but  they  are  quite  close. 
When  GPU  version  SM20  is  run  in  double  precision  mode, 
sincos  had  zero  error,  because  all  errors  were  measured  with 

respect  to  this  run.  Sincosf  and _ sincosf  produced  maximum 

error  of  0.00302  (effectively  zero  percent)  for  the  4Kx4K 
reconstructed  image,  and  0.00339  (effectively  zero  percent) 
for  the  8Kx8K  image.  The  average  error  was  0.0000585  and 
0.0000584  (as  before,  effectively  zero  percent). 

In  the  case  of  GPU  version  SM10  operating  in  hybrid 
precision  mode,  the  maximum  and  average  errors  were  the 
same  as  for  SM10  operating  in  double  precision  mode.  For 
GPU  Version  SM20  operating  in  hybrid  precision  mode,  the 
maximum  and  average  errors  were  the  same  for  sincosf,  as 
well  as  for  the  case  when  these  trigonometric  functions  run  in 
SM20  double  precision  mode.  However,  when  sincos  was 
used  in  SM20  hybrid  mode,  the  maximum  and  average  errors 
were  approximately  10"4  and  10'7,  respectively,  compared  to 
the  “gold  standard”  SM20  double  precision  mode  with  sincos. 
As  noted  previously,  SM10  in  double  precision  mode 
produced  the  same  errors  as  SM20  double  precision  mode. 


V.  Conclusions 

Synthetic  aperture  radar  image  reconstruction  via 
backprojection  is  an  instance  of  a  class  of  problems  that  are 
based  structurally  on  the  tensor  product.  With  some  clever 
manipulation,  backprojection  algorithms  can  be  adapted  to  a 
parallelization  strategy  based  on  tiling  of  the  output  data 
structure  (in  this  case,  a  reconstructed  SAR  image).  However, 
one  must  also  have  a  projection  function  that  associates  a 
computationally  useful  subset  of  the  input  data  structure  (in  this 
case  a  SAR  pulse  array)  with  each  output  tile. 

In  this  paper,  we  present  a  tiling  algorithm  for  SAR  image 
reconstruction  from  thousands  of  pulses  or  views.  This 
algorithms,  called  BP1,  is  adapted  for  implementation  on  a 
single  GPU  (algorithm  BPlsg )  and  a  multi-GPU  cluster 
(BPlmg)  controlled  by  a  multi-core  CPU.  As  theory  predicts, 
execution  time  scales  inversely  with  the  number  of  GPU  slave 
nodes,  reaching  a  speedup  of  lOx  for  10  Nvidia  Tesla  C2050 
GPUs  on  the  Condor  architecture.  Average  image 
reconstruction  error  is  within  +0.05  percent  of  grayscale  range, 
and  is  negligible  in  a  large  portion  of  the  reconstruction 
scenarios. 

We  also  present  ideas  for  future  work,  in  which  the  image 
reconstruction  algorithm  presented  herein  could  be  applied  to 
problems  in  computed  tomography,  fluid  dynamics,  and  other 
imaging  or  simulation  scenarios  based  on  multiple  views. 
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